{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deploy LLM with Ray Serve LLM\n",
    "\n",
    "This guide walks you through deploying a large language model (LLM) using Ray Serve LLM. It covers configuration, deployment, and interaction with the model. \n",
    "\n",
    "The example maintains compatibility with the OpenAI API while leveraging Ray Serve LLM’s powerful features for production-grade deployments.\n",
    "\n",
    "For more details, see the Ray Serve LLM API documentation: https://docs.ray.io/en/latest/serve/llm/serving-llms.html\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "  <b>Anyscale-Specific Configuration</b>\n",
    "  \n",
    "  <p>Note: This tutorial is optimized for the Anyscale platform. When running on open source Ray, additional configuration is required. For example, you’ll need to manually:</p>\n",
    "  \n",
    "  <ul>\n",
    "    <li>\n",
    "      <b>Configure your Ray Cluster:</b> Set up your multi-node environment (including head and worker nodes) and manage resource allocation (e.g., autoscaling, GPU/CPU assignments) without the Anyscale automation. See the Ray Cluster Setup documentation for details: <a href=\"https://docs.ray.io/en/latest/cluster/getting-started.html\">https://docs.ray.io/en/latest/cluster/getting-started.html</a>.\n",
    "    </li>\n",
    "    <li>\n",
    "      <b>Manage Dependencies:</b> Install and manage dependencies on each node since you won’t have Anyscale’s Docker-based dependency management. Refer to the Ray Installation Guide for instructions on installing and updating Ray in your environment: <a href=\"https://docs.ray.io/en/latest/ray-core/handling-dependencies.html\">https://docs.ray.io/en/latest/ray-core/handling-dependencies.html</a>.\n",
    "    </li>\n",
    "    <li>\n",
    "      <b>Set Up Storage:</b> Configure your own distributed or shared storage system (instead of relying on Anyscale’s integrated cluster storage). Check out the Ray Cluster Configuration guide for suggestions on setting up shared storage solutions: <a href=\"https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html\">https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html</a>.\n",
    "    </li>\n",
    "  </ul>\n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dependancies\n",
    "\n",
    "Ensure you have the correct dependencies and install the required python packages using:\n",
    "\n",
    "```bash\n",
    "pip install \"ray[serve,llm]>=2.45.0\" \n",
    "```\n",
    "\n",
    "**Note:**\n",
    "If you are on Anyscale platform, you can use the docker image: `anyscale/ray-llm:2.45.0-py311-cu124`.\n",
    "\n",
    "Otherwise, feel free to build you own docker image on Anyscale as well, which could potentially speed up the workspace spin up time and worker node load time.\n",
    "\n",
    "We have included the Dockerfile in the workspace. \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up Your LLM Deployment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Configure the Deployment and Workder Node\n",
    "Create an LLMConfig object that sets up your model’s runtime environment, loading parameters, and engine options.  \n",
    "\n",
    "In this deployment, we set `accelerator_type='L4'` to use the  `L4` GPU node. \n",
    "\n",
    "Because the model is large, we use tensor parallelism by setting `'tensor_parallel_size': 4` to distribute the load across 4 GPUs.\n",
    "\n",
    "**Load Huggingface gated models:** \n",
    "\n",
    "Qwen models do not require the Hugging Face token, but some models (such as Llama 3.1 models) may require registration and access.  To use the gated Huggingface models, follow the link: https://docs.ray.io/en/latest/serve/llm/serving-llms.html#how-do-i-use-gated-huggingface-models\n",
    "\n",
    "**To use other engine arguments from vLLM**\n",
    "\n",
    "If you would like to use more arguments from vLLM, checkout: https://docs.vllm.ai/en/latest/serving/engine_args.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 05-19 09:42:05 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform\n"
     ]
    }
   ],
   "source": [
    "from ray import serve\n",
    "from ray.serve.llm import LLMConfig\n",
    "\n",
    "\n",
    "llm_config = LLMConfig(\n",
    "    model_loading_config={\n",
    "        'model_id': 'Qwen/Qwen2.5-32B-Instruct'\n",
    "    },\n",
    "    engine_kwargs={\n",
    "        'max_num_batched_tokens': 8192,\n",
    "        'max_model_len': 8192,\n",
    "        'max_num_seqs': 64,\n",
    "        'tensor_parallel_size': 4,\n",
    "        'trust_remote_code': True,\n",
    "    },\n",
    "    accelerator_type='L4',\n",
    "    deployment_config={\n",
    "        'autoscaling_config': {\n",
    "            'target_ongoing_requests': 32\n",
    "        },\n",
    "        'max_ongoing_requests': 64,\n",
    "    },\n",
    ")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tips for Faster Model Downloading / Loading:\n",
    "\n",
    "1. **For faster model downloading**, you can enable fast download by setting HF_HUB_ENABLE_HF_TRANSFER and installing with `pip install hf_transfer`. Check out: https://docs.ray.io/en/latest/serve/llm/serving-llms.html#why-is-downloading-the-model-so-slow\n",
    "\n",
    "\n",
    "2. **For faster model loading**, you can download model weights to Anyscale’s cluster storage improves cluster startup times and scaling efficiency. For example, we can specify the Qwen model files (around 60GB in size)  stored at: `/mnt/cluster_storage/Qwen/Qwen2.5-32B-Instruct`\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Start Ray Serve and Deploy Your Model\n",
    "\n",
    "You can directly use `build_openai_app` to build a OpenAI API compatible app:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-05-19 09:42:09,192\tINFO worker.py:1694 -- Connecting to existing Ray cluster at address: 10.0.26.188:6379...\n",
      "2025-05-19 09:42:09,202\tINFO worker.py:1879 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32mhttps://session-xl5p5c8v2puhejgj5rjjn1g6ht.i.anyscaleuserdata.com \u001b[39m\u001b[22m\n",
      "2025-05-19 09:42:09,210\tINFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_64fd167031b33f561d300e31010ccea98347bd4a.zip' (3.45MiB) to Ray cluster...\n",
      "2025-05-19 09:42:09,225\tINFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_64fd167031b33f561d300e31010ccea98347bd4a.zip'.\n",
      "\u001b[36m(ProxyActor pid=12151)\u001b[0m INFO 2025-05-19 09:42:12,727 proxy 10.0.26.188 -- Proxy starting on node 7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f (HTTP port: 8000).\n",
      "INFO 2025-05-19 09:42:12,805 serve 8988 -- Started Serve in namespace \"serve\".\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m INFO 2025-05-19 09:42:12,834 controller 12094 -- Deploying new version of Deployment(name='LLMDeployment:Qwen--Qwen2_5-32B-Instruct', app='default') (initial target replicas: 1).\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m INFO 2025-05-19 09:42:12,835 controller 12094 -- Deploying new version of Deployment(name='LLMRouter', app='default') (initial target replicas: 2).\n",
      "\u001b[36m(ProxyActor pid=12151)\u001b[0m INFO 2025-05-19 09:42:12,772 proxy 10.0.26.188 -- Got updated endpoints: {}.\n",
      "\u001b[36m(ProxyActor pid=12151)\u001b[0m INFO 2025-05-19 09:42:12,837 proxy 10.0.26.188 -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m INFO 2025-05-19 09:42:12,938 controller 12094 -- Adding 1 replica to Deployment(name='LLMDeployment:Qwen--Qwen2_5-32B-Instruct', app='default').\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m INFO 2025-05-19 09:42:12,940 controller 12094 -- Adding 2 replicas to Deployment(name='LLMRouter', app='default').\n",
      "\u001b[36m(ProxyActor pid=12151)\u001b[0m INFO 2025-05-19 09:42:12,854 proxy 10.0.26.188 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x75df38c78890>.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(autoscaler +20s)\u001b[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.\n",
      "\u001b[36m(autoscaler +20s)\u001b[0m [autoscaler] [4xA10G:48CPU-192GB] Attempting to add 1 node(s) to the cluster (increasing from 0 to 1).\n",
      "\u001b[36m(autoscaler +25s)\u001b[0m [autoscaler] [4xA10G:48CPU-192GB] Launched 1 instances.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:42:43,001 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{\"CPU\": 1.0}, {\"GPU\": 1.0, \"accelerator_type:A10G\": 0.001}, {\"GPU\": 1.0, \"accelerator_type:A10G\": 0.001}, {\"GPU\": 1.0, \"accelerator_type:A10G\": 0.001}, {\"GPU\": 1.0, \"accelerator_type:A10G\": 0.001}], total resources available: {}. Use `ray status` for more details.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:42:43,002 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {\"CPU\": 1}, total resources available: {}. Use `ray status` for more details.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:43:13,011 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{\"CPU\": 1.0}, {\"GPU\": 1.0, \"accelerator_type:A10G\": 0.001}, {\"GPU\": 1.0, \"accelerator_type:A10G\": 0.001}, {\"GPU\": 1.0, \"accelerator_type:A10G\": 0.001}, {\"GPU\": 1.0, \"accelerator_type:A10G\": 0.001}], total resources available: {}. Use `ray status` for more details.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:43:13,012 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {\"CPU\": 1}, total resources available: {\"CPU\": 45.0}. Use `ray status` for more details.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeReplica:default:LLMRouter pid=3173, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:18 [__init__.py:239] Automatically detected platform cuda.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m WARNING 2025-05-19 09:43:18,926 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- VLLM_USE_V1 environment variable is not set, using vLLM v0 as default. Later we may switch default to use v1 once vLLM v1 is mature.\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m No cloud storage mirror configured\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:43:18,940 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- Downloading the tokenizer for Qwen/Qwen2.5-32B-Instruct\n",
      "\u001b[36m(ProxyActor pid=3181, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:43:19,673 proxy 10.0.2.213 -- Proxy starting on node bfba2b6c78ebe78a0517c4e46aa9a7d4229b0b17a65d9fadabc26c3e (HTTP port: 8000).\n",
      "\u001b[36m(ProxyActor pid=3181, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:43:19,722 proxy 10.0.2.213 -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.\n",
      "\u001b[36m(ProxyActor pid=3181, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:43:19,733 proxy 10.0.2.213 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x71013493c110>.\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m You are using a model of type qwen2 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m You are using a model of type qwen2 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(pid=4692, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:27 [__init__.py:239] Automatically detected platform cuda.\u001b[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)\u001b[0m\n",
      "\u001b[36m(_get_vllm_engine_config pid=4692, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:36 [config.py:585] This model supports multiple tasks: {'score', 'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:43:36,631 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Getting the server ready ...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:41 [__init__.py:239] Automatically detected platform cuda.\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:41 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2) with config: model='Qwen/Qwen2.5-32B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-32B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-32B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={\"splitting_ops\":[],\"compile_sizes\":[],\"cudagraph_capture_sizes\":[64,56,48,40,32,24,16,8,4,2,1],\"max_capture_size\":64}, use_cached_outputs=True, \n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:42 [ray_utils.py:288] Ray is already initialized. Skipping Ray initialization.\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:42 [ray_utils.py:314] Using the existing placement group\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:42 [ray_distributed_executor.py:176] use_ray_spmd_worker: False\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:43:43,110 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:43:43,111 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:43:46,679 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(pid=4895, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:47 [__init__.py:239] Automatically detected platform cuda.\n",
      "\u001b[36m(pid=4894, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:47 [__init__.py:239] Automatically detected platform cuda.\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:48 [ray_distributed_executor.py:352] non_carry_over_env_vars from config: set()\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:48 [ray_distributed_executor.py:354] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USE_V1']\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:48 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /home/ray/.config/vllm/ray_non_carry_over_env_vars.json file\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:49 [cuda.py:291] Using Flash Attention backend.\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:51 [utils.py:931] Found nccl from library libnccl.so.2\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:51 [pynccl.py:69] vLLM is using nccl==2.21.5\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m WARNING 05-19 09:43:51 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:51 [shm_broadcast.py:259] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_be83b251'), local_subscribe_addr='ipc:///tmp/480e4aeb-e87f-414e-a20c-f85308fc985a', remote_subscribe_addr=None, remote_addr_ipv6=False)\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:51 [parallel_state.py:954] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:51 [model_runner.py:1110] Starting to load model Qwen/Qwen2.5-32B-Instruct...\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:52 [weight_utils.py:265] Using model weights format ['*.safetensors']\n",
      "\u001b[36m(pid=4893, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:47 [__init__.py:239] Automatically detected platform cuda.\u001b[32m [repeated 2x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:43:57,728 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:44:08,780 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:44:13,140 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:44:13,141 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:44:19,829 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:44:30,880 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:44:41,928 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:44:43,166 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:44:43,166 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:44:52,976 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:45:04,024 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:45:13,192 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:45:13,193 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:45:15,073 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:45:26,119 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:45:37,134 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:45:43,220 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:45:43,220 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayWorkerWrapper pid=4895, ip=10.0.2.213)\u001b[0m INFO 05-19 09:45:47 [weight_utils.py:281] Time spent downloading weights for Qwen/Qwen2.5-32B-Instruct: 114.670886 seconds\n",
      "\u001b[36m(RayWorkerWrapper pid=4893, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:49 [cuda.py:291] Using Flash Attention backend.\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(RayWorkerWrapper pid=4893, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:51 [utils.py:931] Found nccl from library libnccl.so.2\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(RayWorkerWrapper pid=4893, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:51 [pynccl.py:69] vLLM is using nccl==2.21.5\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(RayWorkerWrapper pid=4893, ip=10.0.2.213)\u001b[0m WARNING 05-19 09:43:51 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(RayWorkerWrapper pid=4893, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:51 [parallel_state.py:954] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(RayWorkerWrapper pid=4893, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:51 [model_runner.py:1110] Starting to load model Qwen/Qwen2.5-32B-Instruct...\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(RayWorkerWrapper pid=4893, ip=10.0.2.213)\u001b[0m INFO 05-19 09:43:52 [weight_utils.py:265] Using model weights format ['*.safetensors']\u001b[32m [repeated 3x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading safetensors checkpoint shards:   0% Completed | 0/17 [00:00<?, ?it/s]\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:45:48,144 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "Loading safetensors checkpoint shards:   6% Completed | 1/17 [00:00<00:06,  2.50it/s]\n",
      "Loading safetensors checkpoint shards:  12% Completed | 2/17 [00:00<00:06,  2.27it/s]\n",
      "Loading safetensors checkpoint shards:  18% Completed | 3/17 [00:01<00:06,  2.22it/s]\n",
      "Loading safetensors checkpoint shards:  24% Completed | 4/17 [00:01<00:05,  2.35it/s]\n",
      "Loading safetensors checkpoint shards:  29% Completed | 5/17 [00:02<00:05,  2.31it/s]\n",
      "Loading safetensors checkpoint shards:  35% Completed | 6/17 [00:02<00:04,  2.25it/s]\n",
      "Loading safetensors checkpoint shards:  41% Completed | 7/17 [00:03<00:04,  2.22it/s]\n",
      "Loading safetensors checkpoint shards:  47% Completed | 8/17 [00:03<00:04,  2.20it/s]\n",
      "Loading safetensors checkpoint shards:  53% Completed | 9/17 [00:03<00:03,  2.43it/s]\n",
      "Loading safetensors checkpoint shards:  59% Completed | 10/17 [00:04<00:02,  2.40it/s]\n",
      "Loading safetensors checkpoint shards:  65% Completed | 11/17 [00:04<00:02,  2.32it/s]\n",
      "Loading safetensors checkpoint shards:  71% Completed | 12/17 [00:05<00:02,  2.27it/s]\n",
      "Loading safetensors checkpoint shards:  76% Completed | 13/17 [00:05<00:01,  2.23it/s]\n",
      "Loading safetensors checkpoint shards:  82% Completed | 14/17 [00:06<00:01,  2.21it/s]\n",
      "Loading safetensors checkpoint shards:  88% Completed | 15/17 [00:06<00:00,  2.20it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayWorkerWrapper pid=4895, ip=10.0.2.213)\u001b[0m INFO 05-19 09:45:55 [loader.py:447] Loading weights took 7.66 seconds\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading safetensors checkpoint shards:  94% Completed | 16/17 [00:07<00:00,  2.20it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayWorkerWrapper pid=4895, ip=10.0.2.213)\u001b[0m INFO 05-19 09:45:55 [model_runner.py:1146] Model loading took 15.3918 GB and 123.242199 seconds\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:07<00:00,  2.21it/s]\n",
      "Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:07<00:00,  2.26it/s]\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m \n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:45:59,195 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(RayWorkerWrapper pid=4895, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:06 [worker.py:267] Memory profiling takes 10.85 seconds\n",
      "\u001b[36m(RayWorkerWrapper pid=4895, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:06 [worker.py:267] the current vLLM instance can use total_gpu_memory (21.98GiB) x gpu_memory_utilization (0.90) = 19.78GiB\n",
      "\u001b[36m(RayWorkerWrapper pid=4895, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:06 [worker.py:267] model weights take 15.39GiB; non_torch_memory takes 0.21GiB; PyTorch activation peak memory takes 0.72GiB; the rest of the memory reserved for KV Cache is 3.47GiB.\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:45:55 [loader.py:447] Loading weights took 7.61 seconds\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:45:55 [model_runner.py:1146] Model loading took 15.3918 GB and 123.829000 seconds\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:07 [executor_base.py:111] # cuda blocks: 3549, # CPU blocks: 4096\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:07 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 6.93x\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:46:10,246 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...\n",
      "Capturing CUDA graph shapes:   0%|          | 0/11 [00:00<?, ?it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:11 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Capturing CUDA graph shapes:   9%|▉         | 1/11 [00:00<00:07,  1.43it/s]\n",
      "Capturing CUDA graph shapes:  18%|█▊        | 2/11 [00:01<00:06,  1.50it/s]\n",
      "Capturing CUDA graph shapes:  27%|██▋       | 3/11 [00:01<00:05,  1.53it/s]\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:46:13,234 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m WARNING 2025-05-19 09:46:13,235 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m This may be caused by a slow __init__ or reconfigure method.\n",
      "Capturing CUDA graph shapes:  36%|███▋      | 4/11 [00:02<00:04,  1.56it/s]\n",
      "Capturing CUDA graph shapes:  45%|████▌     | 5/11 [00:03<00:03,  1.61it/s]\n",
      "Capturing CUDA graph shapes:  55%|█████▍    | 6/11 [00:03<00:03,  1.65it/s]\n",
      "Capturing CUDA graph shapes:  64%|██████▎   | 7/11 [00:04<00:02,  1.69it/s]\n",
      "Capturing CUDA graph shapes:  73%|███████▎  | 8/11 [00:04<00:01,  1.73it/s]\n",
      "Capturing CUDA graph shapes:  82%|████████▏ | 9/11 [00:05<00:01,  1.76it/s]\n",
      "Capturing CUDA graph shapes:  91%|█████████ | 10/11 [00:05<00:00,  1.78it/s]\n",
      "Capturing CUDA graph shapes: 100%|██████████| 11/11 [00:07<00:00,  1.53it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:18 [model_runner.py:1570] Graph capturing finished in 7 secs, took 0.27 GiB\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:18 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 22.35 seconds\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:07 [worker.py:267] Memory profiling takes 11.01 seconds\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:07 [worker.py:267] the current vLLM instance can use total_gpu_memory (21.98GiB) x gpu_memory_utilization (0.90) = 19.78GiB\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:07 [worker.py:267] model weights take 15.39GiB; non_torch_memory takes 0.21GiB; PyTorch activation peak memory takes 0.72GiB; the rest of the memory reserved for KV Cache is 3.47GiB.\u001b[32m [repeated 3x across cluster]\u001b[0m\n",
      "\u001b[36m(RayWorkerWrapper pid=4893, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:11 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\u001b[32m [repeated 3x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:46:18,377 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Server is ready.\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:46:18,377 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- Started vLLM engine.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(pid=13590)\u001b[0m INFO 05-19 09:46:24 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform\n",
      "\u001b[36m(RayWorkerWrapper pid=4893, ip=10.0.2.213)\u001b[0m INFO 05-19 09:46:18 [model_runner.py:1570] Graph capturing finished in 7 secs, took 0.27 GiB\u001b[32m [repeated 3x across cluster]\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:46:25,011 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z d1bdf7ac-880f-4133-aebe-2c73e30cf68c -- CALL llm_config OK 192.3ms\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:46:25,016 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 1d024257-183d-4e5f-bdef-15d16a1fd9b7 -- CALL llm_config OK 196.5ms\n",
      "INFO 2025-05-19 09:46:26,426 serve 8988 -- Application 'default' is ready at http://127.0.0.1:8000/.\n",
      "INFO 2025-05-19 09:46:26,434 serve 8988 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7a0b8195bb90>.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "DeploymentHandle(deployment='LLMRouter')"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ray.serve.llm import build_openai_app\n",
    "\n",
    "# Build and deploy the model with OpenAI api compatibility:\n",
    "llm_app = build_openai_app({\"llm_configs\": [llm_config]})\n",
    "serve.run(llm_app)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Streaming Chat Completions with OpenAI Client\n",
    "\n",
    "If successful, you should see the following information printed out: \n",
    "\n",
    "```\n",
    "\"INFO 2025-03-02 17:17:14,162 serve 61769 -- Application 'default' is ready at http://127.0.0.1:8000/. \n",
    "\n",
    "INFO 2025-03-02 17:17:14,162 serve 61769 -- Deployed app 'default' successfully.\"\n",
    "```\n",
    "\n",
    "**Note**: we have appended **\"v1\"** to the base URL because the OpenAI client requires it.\n",
    "\n",
    "Next, we can initialize an OpenAI client using this URL and an API key (though we don’t need the key for now) and then stream chat completions from your deployed model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:49:36,349 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 4818c277-1144-44b4-a49d-2bfc8d64a11d -- Received streaming request 4818c277-1144-44b4-a49d-2bfc8d64a11d\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:49:36,359 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 4818c277-1144-44b4-a49d-2bfc8d64a11d -- Request 4818c277-1144-44b4-a49d-2bfc8d64a11d started. Prompt: <|im_start|>system\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m <|im_start|>user\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m Hello!<|im_end|>\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m <|im_start|>assistant\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:49:36 [engine.py:310] Added request 4818c277-1144-44b4-a49d-2bfc8d64a11d.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "{\"asctime\": \"2025-05-19 09:49:36,767\", \"levelname\": \"INFO\", \"message\": \"HTTP Request: POST http://localhost:8000/v1/chat/completions \\\"HTTP/1.1 200 OK\\\"\", \"filename\": \"_client.py\", \"lineno\": 1025, \"job_id\": \"02000000\", \"worker_id\": \"02000000ffffffffffffffffffffffffffffffffffffffffffffffff\", \"node_id\": \"7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f\", \"timestamp_ns\": 1747673376767196110}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello!\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:49:36 [metrics.py:481] Avg prompt throughput: 4.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.\n",
      " How can I assist you today?"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:49:37 [engine.py:330] Aborted request 4818c277-1144-44b4-a49d-2bfc8d64a11d.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeReplica:default:LLMRouter pid=3173, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:49:37,128 default_LLMRouter dqlghpps 4818c277-1144-44b4-a49d-2bfc8d64a11d -- POST /v1/chat/completions 200 800.7ms\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:49:37,124 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 4818c277-1144-44b4-a49d-2bfc8d64a11d -- Request 4818c277-1144-44b4-a49d-2bfc8d64a11d finished (stop). Total time: 0.7644044499999723s, Queue time: 0.0048291683197021484s, Generation+async time: 0.7595752816802701s, Input tokens: 31, Generated tokens: 10, tokens/s: 53.97753322001628, generated tokens/s: 13.165252004882019.\n",
      "\u001b[36m(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213)\u001b[0m INFO 2025-05-19 09:49:37,125 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 4818c277-1144-44b4-a49d-2bfc8d64a11d -- CALL /v1/chat/completions OK 777.8ms\n"
     ]
    }
   ],
   "source": [
    "\n",
    "from openai import OpenAI\n",
    "\n",
    "# Initialize client\n",
    "client = OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"fake-key\")\n",
    "model_id='Qwen/Qwen2.5-32B-Instruct' ## model id need to be same as your deployment\n",
    "\n",
    "# Basic chat completion with streaming\n",
    "response = client.chat.completions.create(\n",
    "    model=model_id,\n",
    "    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n",
    "    stream=True\n",
    ")\n",
    "\n",
    "for chunk in response:\n",
    "    if chunk.choices[0].delta.content is not None:\n",
    "        print(chunk.choices[0].delta.content, end=\"\", flush=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Shut down the service\n",
    "\n",
    "When you need to stop your service, simply run the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeController pid=12094)\u001b[0m INFO 2025-05-19 09:49:43,540 controller 12094 -- Removing 1 replica from Deployment(name='LLMDeployment:Qwen--Qwen2_5-32B-Instruct', app='default').\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m INFO 2025-05-19 09:49:43,540 controller 12094 -- Removing 2 replicas from Deployment(name='LLMRouter', app='default').\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:49:45 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m INFO 05-19 09:49:45 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[36m(ServeController pid=12094)\u001b[0m INFO 2025-05-19 09:49:45,560 controller 12094 -- Replica(id='a4imwh9z', deployment='LLMDeployment:Qwen--Qwen2_5-32B-Instruct', app='default') is stopped.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m INFO 2025-05-19 09:49:45,561 controller 12094 -- Replica(id='dqlghpps', deployment='LLMRouter', app='default') is stopped.\n",
      "\u001b[36m(ServeController pid=12094)\u001b[0m INFO 2025-05-19 09:49:45,561 controller 12094 -- Replica(id='o2gy2ltw', deployment='LLMRouter', app='default') is stopped.\n",
      "\u001b[36m(_EngineBackgroundProcess pid=4800, ip=10.0.2.213)\u001b[0m [rank0]:[W519 09:49:46.985931527 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2025-05-19 09:49:46,647\tSUCC scripts.py:772 -- \u001b[32mSent shutdown request; applications will be deleted asynchronously.\u001b[39m\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "!serve shutdown --yes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Production Deployment\n",
    "\n",
    "For a production-ready deployment, use Anyscale Services. This allows you to deploy the Ray Serve application to a dedicated cluster with built-in scalability, fault tolerance, and load balancing.\n",
    "\n",
    "Let's put our deployment code in `serve_llm.py`, then we can deploy the service with this command:\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```text\n",
    "anyscale service deploy serve_llm:llm_app --name=llm-service-qwen2p5-32B\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```text\n",
    "(anyscale +1.6s) Restarting existing service 'llm-service-qwen2p5-32B'.\n",
    "(anyscale +2.3s) Using workspace runtime dependencies env vars: {'HF_TOKEN': 'HF_TOKEN'}.\n",
    "(anyscale +2.3s) Uploading local dir '.' to cloud storage.\n",
    "(anyscale +8.1s) Service 'llm-service-qwen2p5-32B' deployed (version ID: 75vs71q8).\n",
    "(anyscale +8.1s) View the service in the UI: 'https://console.anyscale.com/services/service2_ybvl7arasth81zdll29mfm1jts'\n",
    "(anyscale +8.1s) Query the service once it's running using the following curl command (add the path you want to query):\n",
    "(anyscale +8.1s) curl -H \"Authorization: Bearer v-ysnEivLuvxo3ZITC8b7SkI0jZ1taqXk_eBprAr0TY\" https://llm-service-qwen2p5-32b-xxx.xxx.x.anyscaleuserdata.com/\n",
    "(autoscaler +11m55s) [autoscaler] Downscaling node i-018a78b690239bb67 (node IP: 10.0.2.213) due to node idle termination.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building an LLM Client for  Chat Interactions\n",
    "\n",
    "This section introduces a custom `LLMClient` class that wraps around the OpenAI API. It supports both streaming responses (token-by-token) and full message retrieval.\n",
    "\n",
    "**Note:**\n",
    "Before proceeding, please ensure that the service is operational, as it may take a few moments for it to become fully available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "from typing import Optional, Generator\n",
    "\n",
    "from typing import Dict, List, Union\n",
    "import torch\n",
    "import numpy as np\n",
    "from sentence_transformers import SentenceTransformer\n",
    "from pprint import pprint\n",
    "import chromadb\n",
    "\n",
    "\n",
    "from openai import OpenAI\n",
    "from typing import Optional, Generator\n",
    "\n",
    "class LLMClient:\n",
    "    def __init__(self, base_url: str, api_key: Optional[str] = None, model_id: str = None):\n",
    "        # Ensure the base_url ends with a slash and does not include '/routes'\n",
    "        if not base_url.endswith(\"/\"):\n",
    "            base_url += \"/\"\n",
    "        if \"/routes\" in base_url:\n",
    "            raise ValueError(\"base_url must end with '.com'\")\n",
    "\n",
    "        self.model_id = model_id\n",
    "        self.client = OpenAI(\n",
    "            base_url=base_url + \"v1\",\n",
    "            api_key=api_key or \"NOT A REAL KEY\",\n",
    "        )\n",
    "\n",
    "    def get_response_streaming(\n",
    "        self,\n",
    "        prompt: str,\n",
    "        temperature: float = 0.01,\n",
    "    ) -> Generator[str, None, None]:\n",
    "        \"\"\"\n",
    "        Get a response from the model based on the provided prompt.\n",
    "        Yields the response tokens as they are streamed.\n",
    "        \"\"\"\n",
    "        chat_completions = self.client.chat.completions.create(\n",
    "            model=self.model_id,\n",
    "            messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "            temperature=temperature,\n",
    "            stream=True\n",
    "        )\n",
    "\n",
    "        for chat in chat_completions:\n",
    "            delta = chat.choices[0].delta\n",
    "            if delta.content:\n",
    "                yield delta.content\n",
    "\n",
    "    def get_response(\n",
    "        self,\n",
    "        prompt: str,\n",
    "        temperature: float = 0.01,\n",
    "    ) -> str:\n",
    "        \"\"\"\n",
    "        Get a complete response from the model based on the provided prompt.\n",
    "        \"\"\"\n",
    "        chat_response = self.client.chat.completions.create(\n",
    "            model=self.model_id,\n",
    "            messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "            temperature=temperature,\n",
    "            stream=False\n",
    "        )\n",
    "        return chat_response.choices[0].message.content\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query the Service\n",
    "\n",
    "When you deploy, you expose the service to a publicly accessible IP address which you can send requests to.\n",
    "In the previous cell’s output, copy your API_KEY and BASE_URL. \n",
    "\n",
    "Replace and fill in the placeholder values for the BASE_URL and API_KEY in the following code:\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Streaming Response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```text\n",
    "# Initialize client\n",
    "model_id='Qwen/Qwen2.5-32B-Instruct' ## model id need to be same as your deployment \n",
    "base_url = \"https://llm-service-qwen2p5-32b-xxx.xxx.x.anyscaleuserdata.com\" ## replace with your own service base url\n",
    "api_key = \"\" ## replace with your own api key\n",
    "\n",
    "\n",
    "llm_client = LLMClient(\n",
    "    base_url=base_url,\n",
    "    api_key=api_key,\n",
    "    model_id=model_id,\n",
    ")\n",
    "\n",
    "\n",
    "# --- Get the response with streaming ---\n",
    "prompt = \"what is ray?\"\n",
    "print(\"Model response (streaming):\")\n",
    "for token in llm_client.get_response_streaming(prompt, temperature=0.5):\n",
    "    print(token, end=\"\")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```text\n",
    "Model response (streaming):\n",
    "{\"asctime\": \"2025-05-19 10:01:04,287\", \"levelname\": \"INFO\", \"message\": \"HTTP Request: POST https://llm-service-qwen2p5-32b-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1/chat/completions \\\"HTTP/1.1 200 OK\\\"\", \"filename\": \"_client.py\", \"lineno\": 1025, \"job_id\": \"02000000\", \"worker_id\": \"02000000ffffffffffffffffffffffffffffffffffffffffffffffff\", \"node_id\": \"7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f\", \"timestamp_ns\": 1747674064287065217}\n",
    "Ray is a high-performance distributed computing framework that was originally developed by researchers at the RISELab (formerly known as AMPLab) at the University of California, Berkeley. It is designed to make it easier to write and scale parallel and distributed applications in Python. Ray is particularly well-suited for machine learning, reinforcement learning, and other data-intensive computing tasks.\n",
    "\n",
    "Ray provides several key features:\n",
    "\n",
    "1. **Task Parallelism**: Ray allows you to define tasks that can be executed in parallel across multiple CPUs or GPUs.\n",
    "\n",
    "2. **Actor Model**: Ray supports the actor model of concurrency, which means you can create and manage stateful objects (actors) that can be distributed across multiple nodes.\n",
    "\n",
    "3. **Scalability**: Ray is designed to scale from a single machine to a large cluster, making it easy to distribute computation and data across multiple machines.\n",
    "\n",
    "4. **Integration**: Ray integrates with many popular machine learning frameworks and libraries, such as TensorFlow, PyTorch, and Scikit-learn, making it a versatile tool for data scientists and machine learning engineers.\n",
    "\n",
    "5. **Ease of Use**: Ray aims to be easy to use, with a simple API that allows you to write parallel and distributed applications without needing deep knowledge of distributed systems.\n",
    "\n",
    "Ray is used in various industries for tasks ranging from training large machine learning models to real-time data processing and simulation. It's particularly popular in the reinforcement learning community due to its support for complex, multi-agent scenarios.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Non-Streaming (Full Response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```text\n",
    "prompt = \"what is ray?\"\n",
    "\n",
    "# --- Get the response without streaming ---\n",
    "response = llm_client.get_response(prompt, temperature=0.5)\n",
    "print(\"\\n\\nModel response (non-streaming):\")\n",
    "print(response)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```text\n",
    "{\"asctime\": \"2025-05-19 10:01:31,654\", \"levelname\": \"INFO\", \"message\": \"HTTP Request: POST https://llm-service-qwen2p5-32b-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1/chat/completions \\\"HTTP/1.1 200 OK\\\"\", \"filename\": \"_client.py\", \"lineno\": 1025, \"job_id\": \"02000000\", \"worker_id\": \"02000000ffffffffffffffffffffffffffffffffffffffffffffffff\", \"node_id\": \"7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f\", \"timestamp_ns\": 1747674091654282480}\n",
    "\n",
    "\n",
    "Model response (non-streaming):\n",
    "\"Ray\" can refer to different things depending on the context. Here are a few possibilities:\n",
    "\n",
    "1. **Physics**: In physics, a ray is a line or beam of light, heat, or other form of electromagnetic radiation or particles traveling in a straight line. For example, in optics, rays are used to model the path that light takes.\n",
    "\n",
    "2. **Computer Science**: Ray could refer to \"Ray Tracing,\" a rendering technique used in computer graphics to generate an image by tracing the path of light as pixels in an image plane and simulating the effects of its encounters with virtual objects. It's widely used in video games, movies, and other applications requiring high-quality 3D graphics.\n",
    "\n",
    "3. **Software**: \"Ray\" can also refer to an open-source distributed computing framework developed by the RISELab at UC Berkeley. This framework is designed to enable the development of scalable and high-performance applications, particularly in the fields of machine learning, reinforcement learning, and other data-intensive applications.\n",
    "\n",
    "4. **Brand or Product Name**: \"Ray\" could also be part of a brand name or product name, such as Ray-Ban sunglasses or other consumer products.\n",
    "\n",
    "If you're asking about a specific context or application of \"Ray,\" please provide more details so I can give you a more accurate answer.\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
